DOCUMENT RESUME 


ED 204 374 : Ta 810 362 

AOTHOR Stenner, A. Jackson: Rohlf, Richard J. 

TITLE Construct Definition Methodology and Generalizability 

Theory Applied to Career Education Measurement. 

PUB DATE (79] 
- NOTE 24up. 2 
“EDRS PRICE -  MSP01/PCO1 Plus Postage. 
DESCRIPTORS | *Career Education; Definitions: Grade 9; *MHeasurement 


Techniques: Secondary Bducation; *Test Reliability: 
Vocational. Maturity 

IDENTIPIERS Career Maturity Inventory (Crites);: *Constructs; 

‘ *Generalizability Theory 


ABSTRACT s « 
The merits of generalizability theory in the 
formulation of construct definitions and in the determination of 
reliability pbeite obser are discussed. The broadened conceptualization 
of reliability brought about by Cronbach's generalizability theory is 
‘reviewed. Career Maturity Inventory data from a sample of 60 ninth 
arade students is used to demonstrate the power of the technique to 
estimate reliability coefficients for a number of differing 

® measurement procedures. It is concluded that researchers frequently 
use reliability coefficients that are inflated estimates of the 
precision with which their constructs are being measured. 

(Author/GK) es . 
: 4 


MRARERRAKKKRARAARRRARHE KK RKKARKRKKKEREKKREARE RK KEE KCKKEREKE RREKAKEKEK CRERKRRKEKE 


* ° Reproductions supplied by EDRS are the best that can be made * 
* from the original document. * 
RE RAE E ERR ERRER AREER RRR RRR EER R ERR R EEE E ERR 


ee 


assigned to essentially the same underlying trait. When a trait label such 


ability, we have committed the jangle fallacy. Whenever wefind evidence 


“that two simitarly named: scales are only moderately correlated, there exists — 


i Construct Definition Methodology and Generalizability Theory. 
* apennienaiow 18 REBRODUCE This . Applied to Career Education Measurement tte DepanTiaenr OF eDUCATION | 
| MATERIAL HAS BEEN GRANTED BY : EDUCATIONAL RESOURCES INFORMATION 


: AS Rehlt A. Jackson Stenner and Richard J. Rohlf pine 6 penlpnaabieg as 


originating ‘it. 
O Minor changes have beef made to improve ax 


‘ , ms F reproduction quality. 
| TO THE EDUCATIONAL RESOURCES pPasromie rion 7 T Folts oF Maw atopiniohs Ged iniidbes: 5. 
.| INFORMATION CENTER (ERIC).” oN ; pe *S ment do not necessarily represent official NIE." 
‘ position or policy. ; y 


“The field of career education measurement is in disanpays: betas . 
sounts that today’ s career educatfon instruments: ‘are verbal ability measures | a 
in disguise (See Westbrook's chapter in this volume). A plethora of trait 
names such as Greer maturity, career development, career planning, career 
awareness, and career decision making have, hy the last decade, appeared. as 
vapele to scales. comprised of multiple choice “tte. Many. of. these scales 
appear to be measuring aimettar underlying. traits and certainly the labels — ng 


‘are attached 


have a similar sound or "Jingle" to: them. /Other scale names 
to clusters of Items that appear to measure different. traits and at first . 


glance appear deserving of their unique trait names, e. 9. $3 occupational infor- 


mation, resources for exploration; work conditions, personal £conomics . The 


items of Mises scales look different and- the labels correspondingly are dis- | a 
similar or have a different jangle" to them. 

As ‘instrument developers and users we commit the "jingle" fallacy (Green, 1974) 
when we give the same or nearly the same name to clearly distinct underlying 


traits. Similarly, we commit the "jangle" fallacy. when different labels are 


as Career Maturity is. assigned to a.set of items which in fact measures verbal 


4 


the poss#bility of the jingle fallacy. hen # | | 


a Whether or not a given scale is a measure of verbal ability as |\opposed 

to career maturity is, of course, a auestion of validity, i.e., is the scale 
actually a measure of “what it is intended to measure"---or is it? his 
"chapter asserts that the current state of affairs in career educatiog meaSure- 
ment exists because of the lack of carefully defined and operational Led 
career education constructs ‘and will suggest a theory and methodolog \that 
researchers and practitioners: will, hopefully, fine useful in their slietiuine 


efforts to develop and rattie measurement in the rteld of career educajtion. 


Construct Definition 


Constructs are the means by which science orders ‘observations. Wel take 
it on faith that the universe of our observations can be ordered and subse- 
" quently understood with a comparatively small number of constructs or { ferred. 
: organizing influences. Observations are aggregated and constructs created 
through’ ‘the mental processes of abstraction and induction. When ne observe a 
group of children and. describe some of the children as more aggressive t an 
others, we employ a construct. We create the construct “maggresston" by ob- 
serving that certain behaviors tend to‘vary together and this. pattern of|covar- 
jation among observations we come to designate as aggression. - In describing, 
the differences. in behavior among children, we might conclude. that one child 


is much more aggressive than other children. We drrive at this conclusion 


informally by summing up the frequency of observed aggressive acts and we use. 


the total score as an index of each child's level of aggression. These total 
scores are then compared and we arrive at decisions about:‘each child. | 
This process of weighting individual observations, aggregating the ob- 
servations into a total score and then checking the quality of the constnuct 
“score by determining how wel] tha total score can predict the original ob 
“servations happens so fast and so frequently and works so well in our everyday 


i 
bat 
a 


/ 3 cz, 
. ae a 1 tte. f f 1 


rtd 


lives, that ‘tia is ne) don need to reflect critically on the process itself. 


The search for pattern or regularity among observations is, it seems, just as 


central to our daily lives as it is to scientific activity. Perhaps-because 


the process of observation, Sherciee yon and construct formation is so funda- 
mental to fatty functioning, at is taken for granted in behavioral sctence 
research. Often observations in the form of questionnaire items and test 
questions are aggregated without adequately examining the assumptions and 
implications inherent in the summation ane averaging procedures. The simple 
5! fact that observations are combined nd ‘a total score computed means that_we 
entertain a hypothesis thatthe obser tions are in some way related to one 
another. If the observations are uncdrrelated, then combining: them into a 
total score is a weaningtess undertaki 9. since the total or construct score 
will carry no information about the original observations and catis equently will 
be of no value in explaining anything e se. If, however, the observations are 
correlated, then the construct score has meaning. Precisely what meaning 
depends upon the perceived nature of the \organizing influences responsible for 
the correlatiéns among observations. | 
| A construct then is a theory which sepvestes how its inventor ‘eanetiaes* 
i labels (e.g., career maturity, 


a set of interrelated observations. Cons 


‘ occupational information, career decision making) serve as shorthand expressions 


responsible for correljations among observations. 
, | F ‘ 1 
What ee: observation? In career education measurement the 


‘’, for fypotheses ast the nature of the predominant organizing influences 


most common “observation” would be a person's response to a test or rating 
item. Such observations provide information about a person's placement on a 
scale and serves as an indicant of the extent to which the subject possesses the 


attribute or trait being measured. A set of such indicants (items) comprises 


Pi iE 


; ’ Sf) 
an instrument. The underlying structure or organizing influence operating Pid 


on these observations is often determined by some combination of statistical 
structural anni yeles 6.05% factor analysis, and a logical analysis of the 
item content. - tarrbborst on of the underlying structure is then frequently 
sought by confirmation via hypothesis testing and correlations with other in- 
struments measuring conceptually similar and dissimilar constructs. 

AN observation whether made in-service of the behavioral or physical 
sciences, is prone to error. Error is given more attention in behavioral 
sciences measurement probably because it exists in such abundance. Because 
of its sia a the process of construct. definition must incorporate a 
theory of error. Various approaches to estimating the reliability of a 
measurement) pracedure rest on different assumptions about error and how it 
affects oa we make. | 

Classical reliability theory is based on Spearman's model of an observed 
score (e.9., observation). Basically, an observed score jis a function of two / 
“components , a true score and an error score. Within this framework, models of’ 
reliability have been formulated to assess the relative importance -of each 
component. ° Campbel1(1976) gives an excellent review of the historical 
development of reliability theory. Al] traditional menstiies of reliability 

(alpha, equivalent forms, retest) describe the agreement among repeated 
measurements of the same individauls. Although these reliability measures 
differ in their definition of error, they all assume a single undifferentiated 
source of wigs: Coefficient alpha attributes error to inconsistency in the 
extent to which individual items measure an ab Orieaten Measures 

of stability such s test-retest or equivalent fortis reliability coefficients, 


attribute error to changes in testing conditions; mood of examinee, etc. 


In recent times authors such as Tryon (1957); Cronbach, et al. (1963), 


Cronbach, Glesar, Nanda, and Rajaratnam (1972); Nunally (1967); and Lord 
and Novick (1968) have departed from the classic concept of true vs. eer 
scores and have instead incorporated what has become to be known as the domain 
samp] ing theory of reliability. The notion of a true score was replaced by 
a "domain" or “universe” score which is an individual's score 4 al} 
“observations in a domain or universe could be averaged. Measurement error 
in this framework is the extent to which a sample value differs from the 
population value. ~ : 

This change in focus from a "true score" to "universe score" resulted in ‘ 

increased importance being ie on defining the "universe" from which a- 
particular sample of items has been drawn and to which we want to generalize. 
Initially the concept of universe was restricted to thinking in terms of a 
. universe of content, ow eanptine of reading eonprehens ion items from a 
universe of possible reading comprehension items. However, the work of — 
Cronbach, et al. has broadened this original conceptualization. His work, 
“peferred to as generalizability theory, speaks to sampling of "conditions of 
measurement" which include additional sources of variation to that of just 
deviation amdng samples of items, or components of content. This broadened 
conceptualization can be viewed as a change from a focus on the reliability 
ofan instrument to a focus on the reliability of a measurement procedure. 

| For example, suppose the career maturity of a group of students is rated 
ona number of ‘items! by a number of different teachers on deverat different 
occasions. The traditional view of a content domain would focus .on the: 

items as a sample from the universe of all such similar items. However , | ‘ oe 
Cronbach's generalizability theory forces us to acknowl edge that or are 

probably syetemattd differences in jten scores across occasions which do not 


reflect true change in level of career maturity; and, to recognize that. there 


6 


ty 


. are systematic differences among students that are reflected in observed . 


scores which are not necesarily due to difference in career maturity, e.g.; 
socioeconomic status. Thus, from this perspective we are not only concerned 


with a universe of possible career maturity items, but, in addition, we need 


to think in terms of a universe of possible teachers and a universe of possible © 


occasions, and a universe of possible yespondents. Actually Cronbach does 


flot talk in terms of different universes; but rather, each of the above would 


be considered a "facet" in the universe of measurement conditions. The more 
facets one chooses to include in defining a construct, the broader the universe 


gy of generalization. Cronbach also refers to facets as either "random" or 


"fixed." A fixed facet would be one that route not: vary, f.e., would be a 
constant in the universe. For example, if Hater were considered to be a 
fixed facet in a measurement procedure, the investigator would be planning to 
Givays use the same rater(s) whenever a measurement was taken. Given this 
condition, there would be no systematic differences in observed scores due 
to idiosyncratic differences in rating behavior among raters. However, if 
raters were considered to be a "random" facet, the investigator would be 
broadening the construct definition of career maturity such-that a person's 
"universe score" would be an average score across the universe of career 
maturity items judged across all possible raters. In the "fixed" case, the 
"universe score" would be an average score across the universe of -items as 


judged by a particular rater or set of raters. 


As discussed above, the process of construct definition begins with the 


“recognition that observed scores (observations) are determined by some set 


. -of underlying organizing influences. In addition to "wanted" influences 


causing variation among scores, we must also recognize that there are “un- 


wanted" (error). influences exercising potentially biasing or misleading effects 


i 0, 
/ 


on observed scores. Generalizability theory enables us to specify theees 
sources of variance in observed scores in terms of characteristics of: the 
object of measurement, charactéristics of the indicants (items) ; “character- 
istics of the congen of measurement, and the interactions both within and 
across those categories. | i 

In addition to a cspeentin’ medals ~ generalizability theory, using analysts 
of variance procedures, provides the techniques by which we can apee ney . 
the sources of variance (both wanted and unwanted) in observed scores atid | 
estimate the magnitude of their effects. The procedure also yields a .general- 
vizability coefficient(s) which can be interpreted in a manner similar to tradi- 
tional reliability coefficients, eas; in estimating the- standard. error of : 
‘ ‘qenslivewant. However, before these analysis of variance procedures can be 
applied, it is necessary to design a study in which sources of variance are 
systematically varied. - . } 

General izability theory makes a distinction between G and D Studies. A 
G study is a study in which data is collestes in order to examine a wide 
range of sources of variance affecting a measurement procedure whereas a D 
study (for Decision) selects either the G study design or some modificatton 
of that design for use in estimating the generalizability coefficient’ that 
can be expected in some subsequent application of the neasurenennt procedure. 
A D study does not involve the gathering of data but rather uses the variance 


estimates from the sources designed into the G study to estimate what the 


generalizabili coefficient would be under alternative eonstriet definitions 
and sampling specifications. For example, suppose that the authors of a career 


maturity scal employ a p:c x i x occ (persons nested within class crossed 


with items ossed with occasion) G study design. That is, the career maturity 


scale is administered to several classes on at least two occasions. Under this . 


Seanaris #'s 3, 4 and 5 all employ the broadest permissible construct 
definition (items and moments random} but the Sampling frequencies for items 
_and/or somaiits differ. In Scenario #3 we: estimate what ‘the generafzability 
; coefficient. would be if 50 items were administered on one occasion (i.e., 
moment).° Note that the generalizability coefficient under this construct is 
éotncidentally the same as ee observed under Scenario #1. As a rule when 
the construct wena is brandaned and the sampling specifications are = 
_ unchanged, the generalizability coefficient goes down. Similarly when. the 
‘construct definition is narrowed and consequently the universe of generaliza- 
tion ig narrowed the generalizability coefficient is increased. The 


reasoning. for this outcome is straightforward; if the universe under examination 


| 4s quite broad, then a Yarger nunber of observations must be sampled to attain 


a specified level of prgcision, whereas a narrower universe permits a smaller 
number of observations! to attain the same precision. Under Scenario #4 the 
_item sample remains at N,=50 but the number of | testing ‘sessions is increased, | 
N.=2, resulting in an improvement in the generalizability coefficient (3 =.81). 
. Finally, under Scenario #5 -sampi jpg frequencies are increased for both items 
_ (N;=100) and sonciits (N,=3) resulting in a substantial increase in precision 
of measGrement. 

Classical reliability theory, as practiced in the field of career educa- 
tion measurenent, is unnecessarily restrictive. Disciples of classical theory 
compute a number of equivalence ist ans by correlating student performance 


on split halves of an instrument or by computing coefficient alpha (KR-20) 


for an ‘instrument administered ona single occasion. Similarly, stability or 
retest coefficients are computed by correlating student performance on one 


occasion with performance, SAX. two weeks Jater. Finally, interrater re- 
; liability coefficfents oe computed by correlating the ratings of two or more 


raters of the same: behavior. ‘Each of these forms of reliability coefficient 


‘i 


Oo ae 


\ga 


reflect a single undifferentiated source of error.and, more pparbantly, 


dakivi from different construct definitions. Coefficient alpha accurately 


‘reflects an instrument: s reliability uier the highly restricted construct _ 


definition which treats items as the o only random facet and, SONSPAUEREN the 


pxi aREPAC ETON 4 (ponfounded with the residual) as the only source of error. 


The stability coefficient properly reflects an: instrument's” reliability under 
A 

a construct definition that treats occasions as the only random source of 

variance and /the persons x occasion interaction as the only source of error. 


It is Anportant to recognize that traditional forms of reliability permit 


“error variance to be confounded with true score variance because they do not 


jate among the many possible sources of error. For example, the test- 
retest /reliability coefficient will not "break out" a p x i interaction and 
thus that variance will be a "hidden" component of the true score variance. 


Likewise, the x occ interaction will be a hidden component of the true score 


var ance when calculating coefficient alpha. This can result in an artificially 


inflated estimate of true score variance which, in turn, results in an inflated 
rgliability coefficient.’ The attractiveness of ctheral izabi Tey theory is 

at it permits simultaneous consideration of items, occasions and other facets 
s random sources of error. is 


Generaltizability theory provides a framework that forces us to better 


nceptualize the constructs we use. Unfortunately many investigators do not 
ipvest sufficient time in construct definition activities. Time and aanin 
researchers move ahead to answer substantive research questions without care~ 
fully defining the constructs that figure in their theori¢s or program evalua- 7 
tions. The most common mistake found in the behavior sciences is that investi: 
ors will state: a conceptually broad construct definition, but will use a 
reliability estimate that is based on a much narrower definition, thus yielding 


a coefficient which exaggerates the precision with which: the construct can be 


10 - ' | 


5 , ve” rg : , . 
" measured. Elsewhere we have argued that career education will go the way of 
many previous fads unless,:as a field, it can stake out a set of wel]-defined 
constructs and related instrumenta€ion (Stenner, Strang, Baker, 1978), So far, 
efforts in this “= ‘have -been disappointing. 
Some Special Appl ications 
Many seemingly diverse issues in measurement can be accommodated within 


generalizability theory: - Cronbach (1972) states: ; ss 

What appears today: to be most im loctaat inG auae ; 

is not what the book gave greater spa e to. In 1972 

G Theory, appeared as an-elaborate technical apparatus. 

Today the machinery looms less large than the questions 

the theory enables us to pose. G Theory has a protean 

’ quality.. The procedures and even the /issues take a 

new form in every contest. G Theory enables you to ask 

your questions better; what is.most sighificant for you , 

cannot be supplied from the outside (p. 199). . 


In the discussion to follow we attempt to illustrate the range of appli- 
_ cations for which generalizability theory, coupled with construct definition 


-- methodology, can be useful. 


- Toward a Theory of the Indicant 


"A historical convention for’ which we can find no rational éxplanation has 

“contributed to avoidance of a potentially fruitful type of construct valida- 

tion study. The convention is to paper person scores as number of items 

(or indicants) correct and item scores as proportion of respondents answering = 
an item correctly. Thus person scores. and item scores are-expressed in different 
metrics, leading some investigators to assume that there 1s some fundamental 
difference in the way people and items can be analyzed. For’ example, ‘construct 
validity studtes often emphasize relationships between théoretically relevant 
_ valirbles and the eoustrudt: under study and use the person as the unit of 


analysis. On the other hand; little work has been done in explaining variance 


7 


4 | oo. Li ne : sa 


a = ; . a ; ‘ : 
in item scores. Some authors; including ourselves, contend that many career 


development scales containing items of the BuNeip leo choice variety in fact, 


» measure verte! ability ane ‘not career, maturity (see Westbrook's cmipier in 


this volume). _ One Suoroach to investi gating this contention. which, on the 


face, seems more direct than focusing on po score correlations, would 
inyolve predicting item scores using a ni of item readability and syntax 


“measures as well as theoretically derived. ratings of the extent of career 


maturity called for by each item. If the readability and syntax measures 


explain a large proportion of the variance in item scores and the theory- 


based ratings explain little of the variance, then it is likely that the so- 


called career development. items are really verbal reasoning items in disguise. 


If a construct is really well defined, then it should be possible to explain 


the behavior of indicants of that construct, i.e., explain variance in item 
or indicant scores. Unfortunately this is a test that few constructs in the 


behavioral sciences, let alone career education, have passed. 


Many outcomes in career education do not readily lend themselves to paper- 
pencil testing. For example, outcomes such as employability skills, personal 
work -habits and job interview behavior are better measured by trained observers 
in either real or elmutated settings. General izabil ity theory provides a 
framework for estimating the dependability of these ratings. 

Suppose|a career education program sets about to improve the job interview 
behavior of a group of students. Five employers from the..local conmuntty are 
called upon to interview each student and complete a rating scale. One highly 


informative design for examining the generalizability of these ratings would 


be p x i x r (persons crossed with items crossed with raters). Thus each student 


12 


.# a ; , a ‘s ; \ 
would be rated by each employer gn all. items. Under this design the broadest 
ee permisatpie construct definition general izes over items and raters. 
‘ 


ie Separate estimates of al pha or interrater agreement would overestimate 


the prectsi gn with which the construct as defined can be measured. The 
general 2aptl ty coefficient more accurately ref] ects our measurement pre- 
cision and provides information on how the precision can be increased to an 
yeceptable level. One excellent illustration of this type of analysis is pro- 
2 vided by Gilmore, : ‘Kane and Naccarato (1978p. Note that this design does not 
. “include the’ "occasion" facet. If a- ‘sizeable p x occ interaction exits,\our. \ 
Lestinate-of measurenent pects fon may be inflated if we have defined our con- 


~ 


struct, to be stable over time. 

a a imagen Tes ng . 

: - .”"" Some career edication programs have objectives which state that all 

. sanengs will ahaine particular mastery level in reading and mathematics. 

at ‘ 7 In assessing thidobjecttve:/ instrumentation is needed that has a ‘special kind. 
of reliability. Discussion above focused on developing instruemnts that would 

| maximally differentiate among objects (e.g., students). . In competency or 
‘mastery testing, the objective is. to. differentiate among two groups of students, 

ae _ those that have attained ‘the minimal performance level and those that have not. 
oe General {zabil ity theory provides a Framework for. Studying ‘the dependability. 

of mastery or competency decisions. The most. thorough treatment of. this 


: ‘s - “appl cation of generalizability theory is provided by Brennan and Kane (1977), 


a. 


ba ae bi ‘General fzability of. Class Means 
: eam 7 : Some. rar education ‘evaluations mes class. or school | patho than 
i oe : student as the unit of analysis ‘(Stenni Strang, Baker, 1978) Brenan (1975); 


Kane, Gilmore ant Crooks (1976), Haney (1974) and Kane & Brennan (1977) have 
"suggested | that ‘nena zability theory provides a conceptually and ee 


oi 


M6 y ey ‘ é XN : 


r at : * ' 4 
BD A. lay F » ra . t 
Bk Ba 5 ; , ‘ : ; 
: Ce op ee ea ES 
Mag oF ‘ P es . 
"of > ae ne | a3 3° 4 ea ey 
2 - “4 if EF . ae : irae ve 
2 4 . ’ é : 
“ 


= 


appealing approach to estimating the reliability of class means. The 


simplest design from which we can estimate the generalizability of-class 


leans is p:c x 4 (persons nested within class crossed with items). Note that 
this is the familiar persons x items design (from which coefficient alpha - 
is computed) with the addition that howlatge is available on class membership. 
Under this design we can estimate the reliability of persons nested within 
classes and class means. Ina more complex design such er xi, the 
object of measurement might be persons. (Pp), classes (c) or schools (s). In 
generat, this type of split plot design can prove particularly useful in in 
evaluation in which multiple units of analysis (e.g., students, classes, 
, Schools) are employed (Hayman, Rayder, Stenner and Madey, 1979). *As a wile; 
‘generalizability coefficients, should be computed for each unit of analysis 


em loyed in a research or evaluation study. \ 


In passing we should note ‘that appl ications of generalizability ean. 


ee whic class or school is the object of measurement, ‘have focused exclusively 
on the-mean. or: first moment of the distribution. Lohnes (1972), in an ex- 
cellent but Wargely’ ignored paper, denonstarted that using the variance of a 

_ class or seo cea: as an independent variable might also be useful : 
in predicting our omes. In such studies interest is centered on differentiating 
‘classrooms. not i their means, but rather. in-terms of their variances 


while generalizing over occas fons or some other random facet. 
Issues Of Test Bias | 

Much attention and controversy has surrounded the. issues of race and sex 
bias in testing. Although there are many types of bias, perhaps the most 
pernicious is that which, aived atettr’ of particular racial, ethnic or sex 


‘ groups unfair advantage in responding to certain kinds of itens. It is somewhat 


- & a 


a 


\ 


fronical that career eudcation has as one of tts goals eradication of sex 


role stereotyping (Hoyt, 1975 ) and yet we were unable to find any studies 


of sex bias in career education measurement. 


Some forms of item bias can be effectively studied within the framewor 


of generalizability theory. For example, a stmple p: sx (persons nest 


within sex group crossed with items) will provide information on posstbye 


~ 


sex bias. In this design the component of variance related to sex bids is the 
sxi (sex by, item) interaction. If this component of variance is arge then 
ieee have a different. _meaning (f.e., measure something different) for males © 


‘and females. Examination of the items contributing most heaviyy to the inter- 


“action can sometimes lead to explanations for the source of he bias (e.g., 


| terminology un fami) far: to males or females). Students can also be nested 


facet represents bias. For social and ‘other reasons, iten scores which are 3 


j within pace groups to evaluate racial bias or nested within reading level. 


groups .to evaluate the extent to which {tem meaning Mi conditional on: student 


reading level. a Toe ae 


Re.” 


Although the literature on bias ssa focused alnost- axciucivnly on racial, 


“ethnic and sex chavactertstics, the notion of bias 4s a generic concept. 


Any characteristic of the object of: neasurenent which interacts: with a random. 7 


conditional on race. and sex (i, @., ‘interact with race and sex) have received — 


thet bulk of attention. From ‘theoretical as. well as practical. perspectives, 


there are other types of bias that pose’ equally troublesome. problens. ‘For : 


example, items that ‘take on radically. different meanings depending ‘upon the - 
8 examinee” : reading level’ are Just as invalid as indicators of ‘career’ maturity . 


as. items. that: are conditfonal on tthe sex or race of the > examinee. j : 


wae 
Sem ‘i + 
f - sx : 


, 
the “ 


‘ e 


= o 


/ ae of items, on the dealg. This scenario could be set up and a ‘generalizability 
coefficient estimated given specification of the object of measurement and . 
construct definition (which facet$ are considered fixed and which random) 

This application of general izabil ty theory is analogous to power analysis 
(Cohen, 1977) in which different s mpling scenarfos are evaluated to deter ine 
the prob bility of detecting an effect. In.D studies different measuremen 
scenarios (alternative construct d finitions coupled with alternative sampling 

_ frequencies) ‘sie evaluated to deter ine the Presa yen ‘with which obtects of| * . 


os measurement can be differentiated. 
* 7” 4 


An Illustration of|\Generalizability Theory. , 
; | Before, proceeding with an: example, it may prove useful to reflect on eis 


Pe aes meaning of a genera} izabiTity coeffic ent. as well as ite general form.. A 


ae aes score variance) to these of true score vartance and error saelaneds 
| a , ‘ 5 Pe F on gee a 
r 


= 


ge nT INE es, eros. True score variance _ ee | a 
i @ an es True score variance + error variance | a a. @ ze .* 7 
ry a - : 7 ‘ . ’ / . 
The components, that enter true score variance and error variance change . 
as the construct definition.changes, but the basic expression for a general ‘et The 
feability coeffictent renains the same. Following are several descriptive 
comments about the general izabil ity le that | 2 help in gaining an 

‘intuitive digi of: what this ratio means: 


totes 
c [aes 2 
4 . ri 


" 


° One task of measurement is to differentiate among objects 


(e.g., classrooms or children) on some scale while simul taneous] y’ 


general izing over selected facets. The higher the Seneray Sze tEY 


Lig 
objects. | 


t 


coefficient, the better the differentiation or separation among P a 


Children or classrooms differ on a scale for many reasons (we usually 


refer to these reasons as sources). Some of these reasons are 
important to us and repracant what we want to measure, and others — 

are not of interest and represent noise. Differences among students 
that arise due to reasons we are interested in, we call true score 
differences whereas differences due: to reasons weare not interested 
in, we call error differences. A generalizability coefficient is 
simply the ratio of average squared differences between objects 

that arise from wanted sources divided by the average squared 
differences between objects arising fron wanted and unwanted: sources ; 
Observed score variance is the sum’ of true score variance and error 


variance. Thus the generalizability coefficient represents the . 


proportion of observed score variance that is due to "wanted" a Ee 


of variance. If the generalizability coeffitient-fs high, then a 


“high proportion of the variance in observed scores is due to wanted 


‘sdurees of variation, whereas if the coefficient is Tow, it ‘means that 


only a small proportion of ‘differences among onyechs 18 die: to wanted © a 


sources of variation, ed ott 


=: question which takes us back to what we want our construct ‘to mean. scribing. 


e We can conceive of the generalizability coefficient as a heuristic. 
that describes the confidence with which we can reject ‘the null 
hypothesis, that all objects' true scores are equal. Statisticians 
would use an F extti for this purpose and, in fact, for the simple 


persons x items (p x i) design: 


e@ The general tzabtl ity coefficient is the.-squared correlation between 
observed scores and true scores. The true score is the average score 
we would obtain if all observations across the random facets of the 
“universe of generalization could be exhaustively sampled. Errors of 
measurement (unwanted reasons that objects have different scores) - 
contribute to ordering-people differently on observed scores (which are 
suicles) than they would be ordered if their scores could be averaged ° 
over all facets ‘of interest (e. oi items or days during a two-week 
period). The generalizability ‘coefficient rovides an indication of how 
t differently people are. Vikely to pe ordere: if exhaustive sampling ieee 


“all relevant observations was ‘possible. 
7 , % 


In summary, the generalizability: coefficient provides an acttinte of the’ 
precision oF measurement given a construct definition, It is meaningless to’ 


“refer to a reliability or general zabiT ity coefficient without ‘reference to” abe 


the governing construct definition. “what « construct definition is most appro- : =f 
“priate inva given situation’ is a substantive question ‘that’ often cannot: be 


_-answered by measurement specialists. What definition to employ is a doubt “Gate 


meaning to constructs and increasing our Understanding of variance apis 
fron appticattons of our. measurement. procedures is what the process of | 
construct: definition is ant about, "Ce ae & | 


“ ‘ 
. 


xi" % > _* ‘ a 
i. : ae : : oa ‘ 
* BE Bag He ae 
ay, fe % 4 é . ae . Pa rs 


Table 1 provides estimated G study variance components for a p x i xm 


esign, and Table 2 displays different D study designs or measurement scenarios. 

The data used in this A1tuseration was graciously provided by Dr. Bert Westbrook 
and represents a-subsample of the ninth grade data used in his chapter of this 
volume. ‘The sample consists of 60 students responding to the 50 attitude 
items of the Career Maturity Inventory (CMI). | 

Examination of Table 1 reveals that a large proportion (60%) of the variance 
on this instrument is unexplained by facets of the measurement procedures. The 
second and Site Takoust components of variance are the item (i) and person x . 
item (p x i) interaction, respectively. The person (p) component explains four 
percent of the universe variance. The moment. (m), sarspii x nomant (p xm) 
interaction and item by moment (i xm) interaction explain very small pro- | 
portions of the variance. a 

A major advantage of generalizability theary is that the theory specifies / 
which sources of variance are to be ignored, which contribute to true score ‘(r) 
and which contribute to error (8) in estimating. the generalizability of a 
‘measurement procedure under a particular construct. definition. “Whether stated 
or not, ter are two essential aspects of a measurement procedure that must 
be. made explicit before ‘any reliability or generalizability coefficients | can 
4 be interpreted. Thege are (1) the construct definition, i.e., which facets are 


to be ae ‘Sidon and whtch Fixed, and (2y the ‘sampling frequencies: 


' for each facet included in the. construct definition. 


Table 2 RORaRE construct definitions: and samp] ing. specifications tor 
Five. scenarios. Scenario a displays the general izabil ity coefficient under 


~th classical reliability formulation ‘in which moments (i. evs “short-term . 


eo casions) -are fixed (N, #1) and items are random. The generalizability co- 


al 


‘Table 1 


“‘tilustrative Example of Generalizability 
Analyses for the Attitude Subscale of the CMI 


_ Source | Notation 

| Person Pp 
Item. | i 
Moment m 

‘ Person x Item pxi 
Person % Moteerit pxm 
Item x Moment -ixm 

| Person x Item = pxixme 

_x Moment 


SS 


68.87 
270.41 
1.20: 
661.62. 
11.55, 
11.87 


423,88 


! Estimated © 


df MS Varfance 
59 1.167. 00889 
49 = 5.519' 04328 

1. 4,204 00030 . 
291.229.0811 
59 196 — ,00098 
49 242.0159 
2891 147° ,14662 


/ 


portion of | 
universe varia 


Estimated Pro- 


attributable tc 


to each source 


04 
“18. / 
00 
7 


pe ‘ iss 
* efficient (Zp =.72) under this scenario accurately describes the precision of 


“measurement only if our intérest centers on how well students can be differ- . 


_entfated on.a single gecasion. 
coefficient: ‘alpha.(or KR-20). 
; ‘In scenario #2 moments are "randoin and items sem Hiteds 


| definition corresponds to ‘the . traditional Stability or retest coefficient. an 


i 


This coefficient corresponds to the traditional 


This construct. 


other words, within the framework’ of generalizability theory the traditional. 


retest coefficient may” be computed under a general fzability design of the form = 


px ixm where items: constitute a fixed, facet and moments constitute a.random 


facet. Note. that in ths case’ the a coefficient is higher than the internal’ 


consistency coefficient because the p. x m variance component accounts for virtually 


no variance whereas the ?P x 1 interaction (which contributes to the true score when . 


items are ey actgunts for Va of the universe vartance. 


i 4, 


% 


we 


. ary e 
; a : 20 F ’ 
z " 
- . ae . 


“ 


a — — rs i 7 
l / iy 1 ¢ 
e } . : 


Table 2 | 
Illustrative Scenario Table 
Scenaria Construct Definition Sampling fj Sources of Variance - Generalizabil ity ‘ 
# Random Facets ‘Fixed Facets | Specifications p y 1 om pxi- pxm = mxi-— pxixme Coefficien 
1 Items "Moments — nye = ae i a a ie ee 
° ; a 4 , Nl 
(2 - .. Moments Items Nt a i .78 
| ae | — 
S 2 Items 7 "N= 50 
a ‘ T - = 6 8 - 6 72 
Moments 2 N= 1 
‘ “ * m- 
aS Items Nye 50 
; - & » — T - = 6 6 ° 6 81 
’ Moments a - N®2 : 
oe : : ‘ ss . " ‘ 
vs . +3 i , . ; . a : : ne ‘ x 
5 Items See KG Ns 1002 Se ae 
rt ne i Pee on © ; . T 2 = 8) é- 6 .92 
Momefits, (© yous ws % YR Chith gee 
ha * 5 a 
nis ~™: w 


~~ 


be 
¥ ; 


a REFERENCES Ted entity Sy oP le 


arannait R. L. and Kane, M.T. “An Index: of De endabilit for Mastery Tests. 
‘  dournal of Educational Measurement, 1977, » No. 3, pp. 4 


Campbell, J. Psychometric Theory. . In M.D. _Dunnette (Ed.) Handbook of: °° 
rpotelad "and iecantzational Psychology, | ‘Chicago: Rand McNalty, 1976. © «©. 


. Cohen, J. Statistical Power Analysis For the Behavioral Sciences. New York: 


Green, 0..R. (Ed) The A titude-Ac fevenént Distinc ‘ion. Monterey: ¢1B/ 


* 


; Hayman, John; Rayder, Nick, Stener, A. Jackson, and Madey, Doren L. On 


' Hoyt, Kenneth B. An Introduction to Career Education: AP lic Pa er of the ee 


Kane, M. T. and Brennan, R. L. The Generalizabilit of Class. Means. The 
ae Review of Educational Research, 1977, Vol. 47, No. T, pp. 207-292.. 


; Academic eae 


e 


Crites, John. _Lareer Naturity Inventory: i terey, California: cTa/MeGraw 
HAN, 1973. ‘ a ‘ 


Cronbach, Liid., Gleser,, G. Gs Nanda, H.. an sida oa N. The Dependabiiity 
: t of ty. New - 


of. Behavioral Measuremen s: Multifacet:Studi General za 


Cronbach, Li ays Rajaratham, ht oa eel Gleser, G. “Theory of Generalizability: eS 
- A _tiberalization- oreke) tt Theory... Briti sh Journal of Statistical ond 
; ant - : . se 


sychotogy,- . 

/ : 
Gilmore, G, Mey Kane, A. t.; and Niccaratos R. W The Genevalizabilit of ." ars 
_., Student Ratings of Construction: -Estimation of the Teacher and Course eB ia tae 


McGraw-Hill, 19 

Haney, W. The De nendabi lit of Bauer Mean era Unpublished special 
qualifying paper, Harvard Graduate Schoo} of. Education, October 1974. 
Agaregation, Generalization, and Utility In Educational Syaluation- 
rece onal Evaluation and Policy Analysis, July-August, Vo - 4,- 


\ 


. S. Office of Education. Was ngton,. . ‘ 
Education, and Welfare. 1975. sos ‘a | " 


Kane, M. T., Gillmore, G. M., and Crooks, T. J. _Student_Evaluat ions of. 
. Teaching: The Generalizability of.C ass Means. Journal af Educational 


easurement; 1976, 13, 183. 


es “Paul... Statistical sn ors School Classes. American Educational cf 
eneeren Journa ’ ’ ’ pp -— : ie nd 


“~ 


nd 


& 


a D i i 


sere Ne Me a 
a a 
Sg et 
2 oT] 


o 


— 


Lord, F. M., and Novick, M. R.- Statistical Theories of Mental Test Scores. 
Reading, Mass.: Addison-Wesley, 1968. 


“Nunnally, J: C. Psychometric Theory. New York: McGraw Hill, 1967. 


Stenner, A. Jackson, Strang, Ernest W., Baker, Robert F. Technical 
Assistance In Evaluating Career.Education. Projects: Final Report. 
‘Durham, N.C.: NTS Research Corporation, 1978. | 


bx ae ; 1a: Eg Mibeds & 
_ Tryon, R. C. Reliability and Behavior Domain Validity: Reformulation and 
: * Historical Critique: Psychological Bulletin, 1957, 54, -249, 


